Stream Clustering using Probabilistic Data Structures
نویسنده
چکیده
Most density based stream clustering algorithms separate the clustering process into an online and offline component. Exact summarized statistics are being employed for defining micro-clusters or grid cells during the online stage followed by macro-clustering during the offline stage. This paper proposes a novel alternative to the traditional two phase stream clustering scheme, introducing sketch-based data structures for assessing both stream density and cluster membership with probabilistic accuracy guarantees. A count-min sketch using a damped window model estimates stream density. Bloom filters employing a variation of activeactive buffering estimate cluster membership. Instances of both types of sketches share the same set of hash functions. The resulting stream clustering algorithm is capable of detecting arbitrarily shaped clusters while correctly handling outliers and making no assumption on the total number of clusters. Experimental results over a number of real and synthetic datasets illustrate the proposed algorithm quality and efficiency.
منابع مشابه
Application of Probabilistic Clustering Algorithms to Determine Mineralization Areas in Regional-Scale Exploration Studies
In this work, we aim to identify the mineralization areas for the next exploration phases. Thus, the probabilistic clustering algorithms due to the use of appropriate measures, the possibility of working with datasets with missing values, and the lack of trapping in local optimal are used to determine the multi-element geochemical anomalies. Four probabilistic clustering algorithms, namely PHC,...
متن کاملPROBI: A Heuristic for the probabilistic k-median problem
We develop the heuristic PROBI for the probabilistic Euclidean k-median problem based on a coreset construction by Lammersen et al. [28]. Our algorithm computes a summary of the data and then uses an adapted version of k-means++ [5] to compute a good solution on the summary. The summary is maintained in a data stream, so PROBI can be used in a data stream setting on very large data sets. We exp...
متن کاملPersian Handwritten Digit Recognition Using Particle Swarm Probabilistic Neural Network
Handwritten digit recognition can be categorized as a classification problem. Probabilistic Neural Network (PNN) is one of the most effective and useful classifiers, which works based on Bayesian rule. In this paper, in order to recognize Persian (Farsi) handwritten digit recognition, a combination of intelligent clustering method and PNN has been utilized. Hoda database, which includes 80000 P...
متن کاملA Novel Clustering Algorithm For Application To Large Probabilistic Tractography Data Sets
Introduction The segmentation of brain white matter through clustering of tractography data is a problem that has attracted considerable attention in recent years. Approaches to date have succeeded in identifying the major white matter structures from diffusion tensor tractography, but suffer from a number of limitations, including the dependence upon definition of regions of interest, use of a...
متن کاملIncremental Clustering for the Classification of Concept-Drifting Data Streams
Concept drift is a common phenomenon in streaming data environments and constitutes an interesting challenge for researchers in the machine learning and data mining community. This paper proposes a probabilistic representation model for data stream classification and investigates the use of incremental clustering algorithms in order to identify and adapt to concept drift. An experimental study ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1612.02701 شماره
صفحات -
تاریخ انتشار 2016